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Abstract 

Increasingly larger data sets of processes in space and time ask for statistical models and 
methods that can cope with such data. We show that solutions of stochastic advection- 
diffusion partial differential equations (SPDEs) provide a flexible model class for spatio- 
temporal processes which is computationally feasible also for large data sets. The solution 
of the SPDE has in general a nonseparable covariance structure. Furthermore, its parame- 
ters can be physically interpreted as explicitly modeling phenomena such as transport and 
diffusion that occur in many natural processes in diverse fields ranging from environmental 
sciences to ecology. In order to obtain computationally efficient statistical algorithms we use 
spectral methods to solve the SPDE. This has the advantage that approximation errors do not 
accumulate over time, and that in the spectral space the computational cost grows linearly 
with the dimension, the total computational costs of Bayesian or frequentist inference being 
dominated by the fast Fourier transform. The proposed model is applied to postprocessing of 
precipitation forecasts from a numerical weather prediction model for northern Switzerland. 
In contrast to the raw forecasts from the numerical model, the postprocessed forecasts are 
calibrated and quantify prediction uncertainty. Moreover, they outperform the raw forecasts. 



Keywords: spatio-temporal model, Gaussian process, physics based model, advection-diffusion 
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1 Introduction 



Space-time data arise in many applications, see Cressie and Wikle (2011) for an introduction 
and an overview. Increasingly larger space-time data sets are obtained, for instance, from 
remote sensing satellites or deterministic physical models such as numerical weather prediction 
(NWP) models. Statistical models are needed that can cope with such data. 

As Wikle and Hooten (2010) point out, there are two basic paradigms for constructing 
spatio-temporal models. The first approach is descriptive and follows the traditional geo- 



statistical paradigm, using joint space-time covariance functions (Cressie and Huang 1999 



Gneiting 


2002, 


Ma 


2003; 


Wikle 


2003, 


Stein 


2005 



approach is dynamic and combines ideas from time-series and spatial statistics (Solna and 



Switzer 1996; Wikle and Cressie 1999( Huang and Hsu 


2004 


2005; 


Johannesson et al. 


2007; 


Sigrist et al. 


2012 


>■ 



Even for purely spatial data, developing methodology which can handle large data sets 
is an active area of research. Banerjee et al. (2004) refer to this as the "big n problem". 



Factorizing large covariance matrices is not possible without assuming a special structure 
or using approximate methods. Using low rank matrices is one approach (Nychka et al. 



2002 Banerjee et al. 2008 Cressie and Johannesson 2008 Stein 2008 Wikle 



proposals include using Gaussian Markov random-fields (GMRF) (|Rue and Tjelmeland||2002 



2010). Other 



Rue and Held 2005 Lindgren et al. 2011) or applying tapering (Furrer et al. 2006) so that 



one obtains sparse precision or covariance matrices, respectively, for which calculations can 
be done efficiently. Another proposed solution is to approximate the likelihood so that it can 
be evaluated fast er QVecchia||198ll |Stein et al.||2004| |Fuentes|[2007] |Eidsvik et aLl[2012| ) . |Royle 



and Wikle (2005) and Paciorek| ( |2007 ) use Fourier functions to reduce computational costs. 

In a space-time setting, the situation is the same, if not worse: one runs into a com- 
putational bottleneck with high dimensional data since the computational costs to factorize 
dense NT x NT covariance matrices are 0((./VT) 3 ), N and T being the number of points in 
space and time, respectively. Moreover, specifying flexible and realistic space-time covariance 
functions is a nontrivial task. 

In this paper, we follow the dynamic approach and study models which are defined through 
a stochastic advection-diffusion partial differential equation (SPDE). This has the advantage 
of providing physically motivated parametrizations of space-time covariances. We show that 
when solving the SPDE using Fourier functions, one can do computationally efficient statisti- 
cal inference. In the spectral space, computational costs for the Kalman filter and backward 
sampling algorithms are of order O(NT). As we show, roughly speaking, this computational 
efficiency is due to the temporal Markov property, the fact that Fourier functions are eigen- 
functions of the spatial differential operators, and the use of some matrix identities. The 
overall computational costs are then determined by the ones of the fast Fourier transform 
(FFT) dCooley and Tukey|[l965l ) which are 0(TN logiV). 

Defining Gaussian processes through stochastic differential equations has a long history in 



as Whittle 


(1954), Heine 


(1955) 


(1997 


) and 


Brown et al. 


(2000) 



Recently, Lindgren 



et al. (2011) showed how a certain class of SPDEs can be solved using finite elements to 



obtain parametrizations of spatial GMRF. 

Spectral methods for solving partial differential equation are well established in the nu- 
merical mathematics community (see, e.g., Gottlieb and Orszag (1977), Folland (1992), or 



Haberman (2004)). In contrast, statistical models have different requirements and goals, since 



2 



the (hyper-)parameters of an (S)PDE are not known a priori and need to be estimated. Spec- 
tral methods have also been used in spatio-temporal statistics, mostly for approximating or 



solving deterministic integro-difference equations (IDEs) or PDEs. Wikle and Cressie ( 1999 ) 
introduce a dynamic spatio-temporal model obtained from an IDE that is approximated us- 



ing a reduced-dimensional spectral basis. Extending this work, Wikle (2002) and Xu et al. 



(2005) propose parametrizations of spatio-temporal processes based on IDEs. Modeling trop- 



ical ocean surface winds, Wikle et al. (2001) present a physics based model based on the 
shallow- water equations. See also (Cressie and Wikle 2011, Chapter 7) for an overview of 



basis function expansions in spatio-temporal statistics. 

The novel features of our work are the following. While spectral methods have been 
used for approximating deterministic IDEs and PDEs in the statistical literature, there is 
no article, to our knowledge, that explicitly shows how one obtains a space-time Gaussian 
process by solving an advection-diffusion SPDE using the real Fourier transform. Moreover, 
we present computationally efficient algorithms, for doing statistical inference, which use the 
fast Fourier transform and the Kalman filter. The computational burden can be additionally 
alleviated by applying dimension reduction. We also give a bound on the accuracy of the 
approximate solution. In the application, we postprocess precipitation forecasts, explicitly 
modeling spatial and temporal variation. The idea is that the spatio-temporal model not 
only accounts for dependence, but it also captures and extrapolates dynamically an error 
term of the NWP model in space and time. 

The remainder of this paper is organized as follows. Section [2] introduces the continuous 
space-time Gaussian process defined through the advection-diffusion SPDE. In Section [3j it 
is shown how the solution of the SPDE can be approximated using the two-dimensional real 
Fourier transform, and we give convergence rates for the approximation. Next, in Section |4j 
we show how one can do computationally efficient inference. In Section [5j the spatio-temporal 
model is used as part of a hierarchical Bayesian model, which we then apply for postprocessing 
of precipitation forecasts. 

All the methodology presented in this article is implemented in the R package spate (see 



Sigrist et al. (2012)). 



2 A Continuous Space-Time Model: The 
Advection-Diffusion SPDE 

In one dimension, a fundamental process is the Ornstein-Uhlenbeck process which is governed 
by a relatively simple stochastic differential equation (SDE). The process has an exponen- 
tial covariance function and its discretized version is the famous AR(1) model. In the two 



dimensional spatial case, Whittle (1954) argues convincingly that the process with a Whittle 



correlation function is an "elementary" process (see Section 2.2 for further discussion). If the 



time dimension is added, we think that the process defined through the stochastic partial 
differential equation (SPDE) in ([!]) has properties that make it a good candidate for an "ele- 
mentary" spatio-temporal process. It is a linear equation that explicitly models phenomena 
such as transport and diffusion that occur in many natural processes ranging from environ- 
mental sciences to ecology. This means that, if desired, the parameters can be given a physical 
interpretation. Furthermore, if some parameters equal zero (no advection and no diffusion), 
its covariance structure reduces to a separable one with an AR(1) structure over time and a 
certain covariance structure over space. 
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The advection-diffusion SPDE, also called transport-diffusion SPDE, is given by 



J^(t, s) = -ti- V£(i, s) + V • £V£(i, «) - tf(t, s) + e(i, a), (1) 

with s = (x,y)' G M 2 , where V = ^g^r? g^J is the gradient operator, and, for a vector 

field F = (F x ,F y )', V • F = + is the divergence operator. e(i, s) is a Gaussian 
process that is temporally white and spatially colored. See Section |2.2| for a discussion on 
the choice of the spatial covariance function. The SPDE has the following interpretation. 
Heuristically, an SPDE specifies what happens locally at each point in space during a small 
time step. The first term /i. • V£(t, s) models transport effects (called advection in weather 
applications), n = (fi x , fi y )' being a drift or velocity vector. The second term, V • SV£(t, s), 
is a diffusion term that can incorporate anisotropy. If £1 is the identity matrix, this term 
reduces to the divergence (V-) of the gradient (V) which is the ordinary Laplace operator 
V • V = A = + ^2 • The third term — C£(i, s) diminishes £(t, s) at a constant rate and 
thus accounts for damping. Finally, e(t, s) is a source-sink or stochastic forcing term, also 
called innovation term, that can be interpreted as describing, amongst others, convective 
phenomena in precipitation modeling applications. 

Concerning the diffusion matrix X, we suggest the following parametrization 

_ 1 / cos a sin a \ / cos a sin a \ , > 

Pi \~ T ' s * n a 1 ' cos a J ' s ™ a 7 ' cos a J ' 

The parameters are interpreted as follows. p\ acts as a range parameter and controls the 
amount of diffusion. The parameters 7 and a control the amount and the direction of 
anisotropy. With 7 = 1, isotropic diffusion is obtained. 



Heine (1955) and Whittle (1963) introduced and analyzed SPDEs of similar form as in 



1). Further, Jones and Zhang (1997) also investigated SPDE based models. Brown et al. 



(2000) obtained such an advection-diffusion SPDE as a limit of stochastic integro-difference 



equation models. Without giving any concrete details, Lindgren et al. (2011) suggested that 
this SPDE can be used in connection with their GMRF method. 

Figure [T] illustrates the SPDE in ([T]) and the corresponding PDE without the stochastic 
innovation term. The top row shows a solution to the PDE which corresponds the deter- 
ministic part of the SPDE that is obtained when there is no stochastic term e(t, s). The 
figure shows how the initial state in the top-left plot gets propagated forward in time. The 
drift vector points from north-east to south-west and the diffusive part exhibits anisotropy in 
the same direction. A 100 x 100 grid is used and the PDE is solved in the spectral domain 
using the method described below in Section [3j There is a fundamental difference between 
the deterministic PDE and the probabilistic SPDE. In the first case, a deterministic process 
is modeled directly. In the second case, the SPDE defines a stochastic process. Since the op- 
erator is linear and the input Gaussian, this process is a Gaussian process whose covariance 
function is implicitly defined by the SPDE. The bottom row of Figure [T] shows one sample 
from this Gaussian process. The same initial state as in the deterministic example is used, i.e., 
we use a fixed initial state. For the innovations e(t, s), we choose a Gaussian process that is 
temporally independent and spatially structured according to the Matern covariance function 
with smoothness parameter 1. Again, the drift vector points from north-east to south-west 
and the diffusive part exhibits anisotropy in the same direction. 
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Figure 1. Illustration of the SPDE in ([I]) and the corresponding PDE. The top row illustrates a 
solution to the PDE which corresponds the deterministic part of the SPDE without stochastic 
term e(t, s). The bottom row shows one sample from the distribution specified by the SPDE 
with a fixed initial condition. The drift vector points from north-east to south-west and the 
diffusive part exhibits anisotropy in the same direction. 



Note that the use of this spatio-temporal Gaussian process is not restricted to situations 
where it is a priori known that phenomena such as transport and diffusion occur. In the 
one dimensional case, it is common to use the AR(1) process in situations where it is not 
a priori clear whether the modeled process follows the dynamic of the Ornstein-Uhlenbeck 
SDE. In two dimensions, the same holds true for the process with the Whittle covariance 
function, not to mention the process having an exponential covariance structure. Having this 
in mind, even though the SPDE in ([I]) is physically motivated, it can be used as a general 
spatio-temporal model. As the case may be, the interpretation of the parameters can be more 
or less straightforward. 



2.1 Spectral Density and Covariance Function 



As can be shown using the Fourier transform (see, e.g., Whittle (1963)), if the innovation 
process e(t, s) is stationary with spectral density f(k), the spectrum of the stationary solution 
Z(t,s) of the SPDE gj is 



f(u,k) = f(k) 



(2*0 



(3) 



where k and u are spatial wavenumbers and temporal frequencies. The covariance function 
C(t, s) of £(t, s) is then given by 



C(t,s)= J /(cj,fc)exp (itui)exp (is'k^dkdu) 

^exp (-ip'kt - {k'Y,k + Q\t\) 
2(k"Sk + () 



/(*)- 



exp (is'k)dk, 



(4) 



where the integration over the temporal frequencies to follows from the calculation of the 
characteristic function of the Cauchy distribution ( Abramowitz and Stegun|1964 ). The spatial 
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integral above has no closed form solution but can be computed approximately by numerical 
integration. 

Since, in general, the spectrum does not factorize into a temporal and a spatial component, 
we see that £(£, s) has a non-separable covariance function (see Gneiting et al. (2007b) for 
a definition of separability). The model reduces to a separable one, though, when there is 
no advection and diffusion, i.e., when both fi and X are zero. In this case, the covariance 
function is given by C(t,s) = ^exp (— (\t\)C(s), where C(s) denotes the spatial covariance 
function of the innovation process. 



2.2 Specification of the Innovation Process 



As mentioned before, it is assumed that the innovation process is white in time and spatially 
colored. In principle, one can choose any spatial covariance function such that the covariance 
function in Q is finite at ze ro. Note that if f{k) is integrable, then f(co, k) is also integrable. 
Similarly as Lindgren et al. (2011), we opt for the most commonly used covariance function 



in spatial statistics: the Matern covariance function (see Handcock and Stein (1993), Stein 



(1999)). Since in many applications the smoothness parameter is not estimable, we further 
restrict ourselves to the Whittle covariance function. This covariance function is of the form 
a 2 d/ pqK\ (d/po) with d being the Euclidean distance between two points and K\ (d/po) being 
the modified Bessel function of order 1. It is called after Whittle (1954) who introduced 



it and argued convincingly that it "may be regarded as the 'elementary' correlation in two 
dimensions, similar to the exponential in one dimension." . It can be shown that the stationary 
solution of the SPDE 

1 



V- V 



pi 



e(t,s) = W(t,s) 



(5) 



where W(t, s) is a zero mean Gaussian white noise field with variance a 2 , has the Whittle 
covariance function in space. From this, it follows that the spectrum of the process e(t, s) is 
given by 



/(*) 



(2nf 



k'k + 



(6) 



2.3 Relation to an Integro-Difference Equation 



Assuming discrete time steps with lag A, Brown et al. (2000) consider the following integro- 
difference equation (IDE) 

s) = exp (-AC) / h(s-s')£(t- A,s)ds' + e(t,s), seR 2 , (7) 

with a Gaussian redistribution kernel 

h(s - s') = (2vr)- 1 |2AS|- 1/2 exp (-(s - s' - A/x) T (2AS)^(s - s' - A/x)/2) , 

e(t, s) being temporally independent and spatially dependent. They show that in the limit 
A — > 0, the solution of the IDE and the one of the SPDE in ([I]) coincide. The IDE is 
interpreted as follows: the convolution kernel h(s — s') determines the weight or the amount 
of influence that a location s' at previous time t — A has on the point s at current time t. 
This IDE representation provides an alternative way of interpreting the SPDE model and its 
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parameters. Storvik et al. ( 2002 ) show under which conditions a dynamic model determined 
by an IDE as in ([7]) can be represented using a parametric joint space-time covariance function, 



and vice versa. Based on the IDE in ff\ , |Sigrist et al. (2012 ) construct a spatio-temporal model 



for irregularly spaced data and apply it to obtain short term predictions of precipitation. 

3 Solution in the Spectral Space 

Solutions £(t, s) of the SPDE are defined in continuous space and time. In practice, one 
needs to discretize both space and time. The resulting vector of NT space-time points is in 
general of large dimension. This makes statistical inference, be it frequentist or Bayesian, 
computationally difficult to impossible. However, as we show in the following, solving the 
SPDE in the spectral space alleviates the computational burden considerably and allows for 
dimension reduction, if desired. 



Heuristically speaking, spectral methods (Gottlieb and Orszag (1977), Cressie and Wikle 



(2011 ([Chapter 7]) approximate the solution !;(t,s) by a linear combination of deterministic 



spatial functions 4>ji s ) with random coefficients cxj(t) that evolve dynamically over time: 

K 

£ K (t,s) = = 0(a)'a(t), (8) 

where 4>(s) = (cj)i(s), . . . ,4>k(s))' and at(t) = . . . , olk(£))' ■ As we argue below, it is 

expedient to use Fourier functions 

4>j(s) = exp (ik'jS), (9) 

where i denotes the imaginary number i 2 = — 1 and kj = (kj, k v A' is a spatial wavenumber. 

For solving the SPDE in ([!]), Fourier functions have several advantages. First, differentia- 
tion in the physical space corresponds to multiplication in the spectral space. In other words, 
Fourier functions are eigenfunctions of the spatial differential operator. Instead of approxi- 
mating the differential operator in the physical space and then worrying about approximation 
errors, one just has to multiply in the spectral space, and there is no approximation error 
of the operator. Further, the solution of the SPDE is exact over time, given the frequencies 
included, see Proposition 1 below. In contrast to finite differences, one does not accumulate 
errors over time. The approximation error only depends on the number of spectral terms and 
not on the temporal discretization. This is related to the fact that there is no need for nu- 
merical stability conditions. For statistical applications, where the parameters are not known 
a priori, this is particularly useful. Finally, one can use the FFT for efficiently transforming 
from the physical to the spectral space, and vice versa. However, since Fourier terms are 
global functions, stationarity in space but not in time is a necessary assumption. 

The following result shows that if the initial condition and the innovation process are in 
the space spanned by a finite number of Fourier functions, then the solution of the SPDE ([!]) 
remains in this space for all times and can be given in explicit form. 

Proposition 1. Assume that the initial state and the innovation terms are of the form 

£*(0,a) = #a)'a(0), e K (t, s) = 0(s)'e(t) (10) 
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where <f>(s) = (cj)i(s), . . . , 4>k( s ))' , 4>j( s ) i s given m[5| ande(t) is a K -dimensional Gaussian 
white noise independent of a(0) with 

Cov(e(t), e(0) = <Wdiag (f(kjj) , (11) 



where f is a spectral density. Then the process £ K (t,s) = d>(s)'a.(t) where the components 
ctj(t) are given by 

acj(t) = exp (hjt)a>j(0) + / exp (hj(t — s))'ej(s)ds (12) 

Jo 

with hj = —ifji'kj — k'jJjkj — Q is a solution of the SPDE in ([!]). For t — > oo, the influence 
of the initial condition exp(hjt)aj(0) converges to zero and the process £(t,s) converges to a 
stationary Gaussian process with mean zero and 

Co^ K (t + At,s),i K (t, 8 ')) = 0(s)'diag ( ""^ff/^^ ) 0(»T. 
where .* stands for complex conjugation. 



Proof of Proposition^ By (12), we have 

K K 

dt 



9 C K (t,s) =J>i(*)&00 = Y J (h j a j {t) + e j {t))(t> 3 (s). 

3=1 3=1 



On the other hand, since the functions 4>j(s) = exp(ik'js) are Fourier terms, differentiation 
in the physical space corresponds to multiplication in the spectral space: 

(i ■ Vcf)j(s) = ifi'kj(f)j(s) (13) 

and 

V • SV^(a) = -fcJ.Efej^(a). (14) 
Therefore, by the definition of hj, 

K K 

(-/x • v + v • sv - c) 5>i(*)&00 = I'j'yn'iojis). 

3=1 3=1 

Together, we have 

~£ K (t, s) = Hu • V + V • SV - C) s) + e K (t, s) 

which proves the first part of the proposition. Since the real part of hj is negative, exp(hjt) — > 
for t — > oo. Moreover, 

lim Cov (a j(t + At), ai(t)) = lim exp (hjAt)5jif(kj) / exp (—(hj + h*)(t — s))ds 

t— ><X) " t— >00 ' Jq 

_ exp ( hj At) f ' ^ 
" hj+h * 

and thus the last statement follows. □ 
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We assume that the forcing term e(t, .), and consequently also the solution £(t, .), is 
stationary in space. Recall the Cramer representation for a stationary field e(t, .) 

where % has orthogonal increments Cov(dei(fe), de t >(l)) = d~t,t>$k,if{k) and / is the spectral 



density of e(t, .) (see, e.g., Cramer and Leadbetter (1967)). This implies that we can approx 



imate any stationary field, in particular also the one with a Whittle covariance function, by 
a finite linear combination of complex exponentials, and the covariance of e(t) is a diagonal 
matrix as required in the proposition. 

3.1 Approximation bound 

By passing to the limit K — ¥ oo such that both the wavenumbers kj cover the entire domain 
M. 2 and the distance between neighboring wavenumbers goes to zero, we obtain from Q the 
stationary (in space and time) solution with spectral density as in ([3]). In practice, if one 
uses the discrete Fourier transform (DFT), or its fast variant, the FFT, the wavenumbers are 
regularly spaced and the distance between them is fixed for all K (see below). This implies 
that the covariance function of an approximate solution is periodic which is equivalent to 
assuming a rectangular domain being wrapped around a torus. Since in most applications, 
the domain is fixed anyway, this is a reasonable assumption. 

Based on the above considerations, we assume, in the following, that s G [0, l] 2 with 
periodic boundary condition, i.e., that [0, l] 2 is wrapped on a torus. In practice, to avoid 
spurious periodicity, we can apply what is called "padding". This means that we take s G 
[0,0. 5] 2 and then embed it in [0, l] 2 . As in the discrete Fourier transform, if we choose 
s G [0, l] 2 , it follows that the spatial wavenumbers fe, lie on the n x n grid given by D n = 
{2vr • (i,j) : -(rt/2 + 1) < i,j < n/2} = {-2vr(n/2 + 1), . . . ,2vrn/2} 2 with n 2 = N = K, n 
being an even natural number. We then have the following convergence result. 

Proposition 2. The approximation £, N (t,s) converges in law to the solution ^(t,s) of the 
SPDE with s G [0, l] 2 wrapped on a torus, and we have the bound 

\C(t,s)-C N (t,s)\<o-l-a 2 N , (16) 

where C(t, s) and C N (t, s) denote the covariance functions of£(t,s) and £ N (t, s), respectively, 
and where <r 2 = C(0,0) and a^ N = C N (0,0) denote the marginal variances of these two 
processes. 

Proof of Proposition Similarly as in Q and due to k G 2tt ■ Z 2 , it follows that covariance 
function of £(i, s) is given by 

C(t,s)= / f(oj, k)exp (ituj)duj exp(is' k) 
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where hf~ = —ifi'k — fc'Sfc — £. From Proposition [T] we know that the approximate solution 
£ N (t, s) has the covariance function 



c N (t, S )= y: /( fc ) T+y ) exp( ^ /fc) - 

k£D n k k 



(18) 



It follows that 



\C(t,s)-C N (t,s 



\- ? exp(h k t) , 
Z> h +h* ( 1 - 1 {fc£gn}) ex P(^ fc , 



= <7£ — CT^iv. 



(19) 



□ 



Not surprisingly, this result tells us that the rate of convergence essentially depends on 
the smoothness properties of the process £(t,s), i.e., on how fast the spectrum decays. The 
smoother £(t, s), that is, the more variation is explained by low frequencies, the faster is the 
convergence of the approximation. 

Note that, there is a conceptual difference between the stationary solution of the SPDE 
([I]) with s € K 2 and the periodic one with s E [0, l] 2 wrapped on a torus. For the sake of 
notational simplicity, we have denoted both of them by s). The finite dimensional solution 
£ (t, s) is an approximation to both of the above infinite dimensional solutions. The above 
convergence result, though, only holds true for the solution on the torus. 



3.2 Real Fourier Functions and Discretization in Time and Space 

In order that we can apply the model to real data, we have to discretize it. In the following, 
we consider the process £(t, s) on a regular grid of n x n = N spatial locations s±, . . . , sjy 
in [0, l] 2 and at equidistant time points ti, . . . , with t{ — = A. Note that these two 
assumptions can be easily relaxed, i.e., one can have irregular spatial observation locations and 
non-equidistant time points. The former can be achieved by adopting a data augmentation 



approach (see, for instance, Sigrist et al. (2012)) or by using an incidence matrix (see Section 



4.2). The latter can be done by taking a time varying A. 

For the sake of illustration, we have stated the results in the previous section using complex 
Fourier functions. However, when discretizing the model, one obtains a linear Gaussian state 
space model with a propagator matrix G that contains complex numbers, due to (13). To 



avoid this, we replace the complex terms exp (ikjs) with real cos(kjs) and sin(fc„s) functions. 
In other words, we use the real instead of the complex Fourier transform. The above results 
then still hold true, since for real valued data, the real Fourier transform is equivalent to the 
complex one. For notational simplicity, we will drop the superscript llK " from £ K (t,s). The 
distinction between the approximation and the true solution is clear from the context. 

Proposition 3. On the above specified discretized spatial and temporal domain and using the 
real Fourier transform, a stationary solution of the SPDE ([!]) is of the form 

£(tt+i) = *a(*i+i), (20) 
a(t w ) = Ga(t i ) + e(t i+1 ), e(t i+1 ) ~ N(0,Q), (21) 
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with stacked vectors £(ti) = (£(ti, s{), . . . ,£(ti, sjv))' and cosine and sine coefficients ct(ti) = 
(a^\ti),...,a^\ti),a^\ti) } a^ , where * applies the dis- 

crete, real Fourier transformation, G is a block diagonal matrix with 2x2 blocks, and Q is 
a diagonal matrix. These three matrices are defined as follows. 

• * = [0(si),...,0Oiv)]', 



4>^ } {si) = cos(k'jSi), (f)f\si) = sin(fe^), 
[G]i :4 ,i:4 = diag (exp (-A(fcJ-Sfcj + C))) , 

[G]b:K,5-.K = diag (exp (— A(kjHkj + C)) (cos(A/x / fc J )l 2 — sin(A/x'fcj) J 2)) , where 



1 2 



1 
1 



1 
-1 



(22) 



In summary, at each time point t and spatial point s the solution £(t, s) is the discrete 
real Fourier transform of the random coefficients oc(t) 



4 K/2+2 



0(a)'a(t), 



j=5 



(23) 



and the Fourier coefficients a(t) evolve dynamically over time according to the vector autore- 



gression in (21). The above equations (20) and (21) form a linear Gaussian state space model 



with parametric propagator matrix G and innovation covariance matrix Q, the parametriza- 
tion being determined by the corresponding SPDE. 

Proof of Proposition^ Using 

V • £V0$ c) (s) = -k'^k^fis), V • EV^(fl) = -k'^k j( j)f{s), 

and the same arguments as in the proof of Proposition [TJ it follows that the continuous time 
solution 
we have 



solution is of the form (23). For each pair of cosine - sine coefficients otj(t) = (cr- (i), a- (t))' 



J 



(s)ds, 



(24) 



where 



, kjTikj Q 11 kj 

1 H'kj -k'jSkj-C 



Now Hj can be written as 



H 



-k'jSkj - C)l 2 - n'kj J 2 , 
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where 



l2"(J J), J-2 



1 
-1 



Since I2 and J 2 commute, we have 

e 11 ^ =exp (-t(kj^kj + C)l2)exp (-tp'kjJz) 

=exp (— t(kj'Skj + £)) (cos(t/x'fcj)l2 — sm(tfi'kj) J 2) ■ 

Further, the covariance of the stochastic integral in p4|) is given by 



ft ft 

/ e H ^ t -^f(k j )e H '^ t - s) ds= / f(kj)exp (-2(^.5^ + ()(t - s))l 2 ds 
Jo Jo 



1 - exp (-2(^Sfe i + C)i) 
2(fc$£fc 3 - + C) 1 



Discretizing in time and space then leads to the result in (20) and (121 



□ 



Spatial wavenumbers 
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kx/2pi 



Figure 2. Illustration of spatial wavenumbers for the two-dimensional discrete real Fourier 
transform with n 2 = 400 grid points. 



The discrete complex Fourier transform uses n 2 different wavenumbers kj each having 
a corresponding Fourier term exp(ife„s). The real Fourier transform, on the other hand, 
uses n 2 /2 + 2 different wavenumbers, where four of them have only a cosine term and the 
others each have sine and cosine terms. This follows from the fact that, for real data, certain 
coefficients of the complex transform are the complex transpose of other coefficients. For 



technical details on the real Fourier transform, we refer to Dudgeon and Mersereau (1984), 
Borgman et all ft 19841) , iRoyle and Wiklel (|2005h , and iPaciorekl (120071) . Figure [2| illustrates an 
example of the spatial wavenumbers, with n = 20 x 20 = 400 grid points. The dots with 
a circle represent the wavenumbers actually used in the real Fourier transform, and the red 
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crosses mark the wavenumbers having only a cosine term. Note that in (23) we choose to 



order the spatial wavenumbers such that the first four spatial wavenumbers correspond to 
the cosine-only terms. To get an idea how the basis functions cos(k'js) and sin(fc 3 s) look 

|-V^III = 




0.5 1 



0.5 1 



0.5 1 



0.5 1 



0.5 1 



0.5 1 



Figure 3. Illustration of two dimensional Fourier basis functions used in the discrete real 
Fourier transform with n 2 = 400. On the x- and y-axis are the coordinates of s. 

like, we plot in Figure [3] twelve low- frequency basis functions corresponding to the six spatial 
frequencies closest to the origin (see Figure [2]). Further, in Figure [4| there is an example of 

Propagator matrix G 
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en 
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Column 
Dimensions: 16 x 16 



Figure 4. Illustration of propagator matrix G. 16 real Fourier functions are used (n = 4). 

a propagator matrix G when n = 4, i.e., when sixteen (4 2 ) spatial basis functions are used. 
The upper left 4x4 diagonal matrix corresponds to the cosine-only frequencies. The 2x2 
blocks following correspond to wavenumbers with cosine - sine pairs. 
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3.3 Remarks on Finite Differences 



Another approach to solve PDEs or SPDEs such as the one in ([I]) consists of using a dis- 
cretization such as finite differences. Stroud et al. (2010) use finite differences to solve an 



advection-diffusion PDE. Other examples are Wikle (2003), Xu and Wikle (2007), Duan 



et al. (2009), Malmberg et al. (2008), and Zheng and Aukema (2010). The finite difference 



approximation, however, has several disadvantages. First, each spatial discretization effec- 
tively implies an interaction structure between temporal and spatial correlation. In other 
words, as Xu et al. (2005) state, the discretization effectively suggests a knowledge of the 



scale of interaction, lagged in time. Usually, this space-time covariance interaction structure 
is not known, though. Furthermore, there are numerical stability conditions that need to be 
fulfilled in order that the approximate solution is meaningful. Since these conditions depend 
on the values of the unknown parameters, one can run into problems. 

In addition, computational tractability is an issue. In fact, we have tried to solve the 
SPDE in using finite differences as described in the following. A finite difference ap- 
proximations in ([!]) leads to a vector autoregressive model with a sparse propagator matrix 
being determined by the discretization. The innovation term e can be approximated using 
a Gaussian Markov random field with sparse precision matrix (see Lindgren et al. (2011)). 
Even though the propagator and the precision matrices of the innovations are sparse, we have 
run into a computational bottleneck when using the Forward Filtering Backward Sampling 
(FFBS) algorithm (Carter and Kohn 1994; Friihwirth-Schnatter 1994) for fitting the model. 
The basic problem is that the Kalman gain is eventually a dense matrix. Alternative sam- 
pling schemes like the information filter (see, e.g., Anderson and Moore (1979) and Vivar 



and Ferreira (2009)) did not solve the problem either. However, future research on this topic 



might come up with solutions. 



4 Computationally Efficient Statistical Inference 



As said in the introduction, when using a naive approach, the computational costs for a 
spatio-temporal model with T time points and N spatial points equal 0((NT) s ). Using the 



Kalman filter or the Forward Filtering Backward Sampling (FFBS) algorithm (Carter and 
Kohn||1994 ; |Friihwirth-Schnatter||1994 ), depending on what is needed, these costs are reduced 
to 0(TN^) which, generally, is still too high for large data sets. In the following, we show how 
Bayesian and frequentist inference can be done efficiently in 0(TN log N) operations. In the 
spectral space, the costs of the algorithms grow linearly in the dimension TN, which means 
that the total computational costs are dominated by the costs of the fast Fourier transform 



(FFT) (Cooley and Tukey 1965) which are 0(TN log A~). 



As is often done in a statistical model, we add a non-structured Gaussian term v(ti+\, s) ~ 
N(0,T 2 ),iid, to (20) to account for small scale variation and / or measurement errors. In 



geostatistics, this term is called nugget effect. Denoting the observations at time U by w(ti), 
we then have the following linear Gaussian state space model: 



w(t i+1 ) = &ct(t i+1 ) + v(t i+ i) 
a(t i+1 ) = Ga(t t ) + e(t i+1 ), 



u(t i+1 ) ~ N{0,t 2 1 n ), 
e(i m )~ AT(0,Q). 



(25) 



Note that £(i 



i+l) 



i+l J 



As mentioned before, irregular spatial data can be modeled 



by adopting a data augmentation approach (see Sigrist et al. (2012)) or by using an incidence 
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matrix (see Section 4.2). For the sake of simplicity, a zero mean was assumed. Extending the 
model by including covariates in a regression term is straightforward. Furthermore, we assume 
normality. The model can be easily generalized to allow for data not following a Gaussian 
distribution. For instance, this can be done by including it in a hierarchical Bayesian model 
(HBM) ( |Wikle et al.|l998| > and specifying a non- Gaussian distribution for w\£. A frequentist 
approach is then typically no longer feasible. But approximate posterior probabilities can 
still be computed using, for instance, simulation based methods such as Markov chain Monte 
Carlo (MCMC) (see, e.g., |Gilks et al.| (|1996[) or|Robert and Casella| fl2004[)). An additional 



advantage of BHMs is that these models can be extended, for instance, to account for temporal 
nonstationarity by letting one or several parameters vary over time. 



4.1 Kalman Filtering and Backward Sampling in the Spectral Space 

In a frequentist context, it is crucial that one is able to evaluate the likelihood with a rea- 
sonable computational effort, and when doing Bayesian inference, one needs to be able to 
simulate efficiently from the full conditional of the latent process [£|-], or, equivalently, the 
Fourier coefficients [&{■]■ Below, we show how both these tasks can be done in the spectral 
space in linear time, i.e., using 0(TN) operations. For transforming between the physical 
and spectral space, one can use the FFT which requires 0(TN log N) operations. We start 
with the spectral version of the Kalman filter. Its output is used for both evaluating the 
log-likelihood and for simulating from the full conditional of the coefficients ex. 

Algorithm 1 Spectral Kalman filter 
Input: T,w, G, r 2 , Q, H 

Output: forecast and filter means m ti u._ 1 , rn t .\ ti and covariance matrices i^i^, -R ti |t i _ 1 , 
i = l,...,T 

m t \t = 

^o|to = Q 

for i = 1 to T do 

Rti\u-i = Q + Rti-^t^H 
R u\u = {t~ 2 i n + Rtj^y 1 

m n\u = m U\u^ + r 2 R U \ U [w(pi) - mtii^J 
end for 



Algorithm [T] shows the Kalman filter in the spectral space. For the sake of simplicity, we 
assume that the initial distribution equals the innovation distribution. The spectral Kalman 
filter has as input the Fourier transform of w = (w(ti)' , . . . , w^t)')' ofw, the diagonal matrix 
H given by 



[H]is4, 1: 4 = diag (exp (-2A(A$EAy + C))) , 
[H} 5:N , 5:N = diag (exp (-2A(fcjEfej + C))l 2 ) , 



(26) 



and other parameters that characterize the SPDE model. It returns forecast and filter means 
m ti | t ._ a and m ti u. and covariance matrices Rt^ ti and R t .\ t ._ l , i = 1, . . . ,T, respectively. Note 
that we follow the notation of Kiinsch (2001). Since the matrices Q and H are diagonal, the 
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covariance matrices RtMi an d RtAti-i are ^ so diagonal. Note that the matrix notation 
in Algorithm [T] is used solely for illustrational purpose. In practice, matrix vector products 
(Gm.^ i _ 1 k i _ 1 ), matrix multiplications (R t ._ 1 u._ 1 H), and matrix inversions ((t -2 +R t .u._ 1 )~ 1 ) 
are not calculated with general purpose algorithms but elementwise since all matrices are 
diagonal. It follows that the computational costs for this algorithm are 0(TN). 



The derivation of this algorithm follows from the classical Kalman filter (see, e.g., Kiinsch 
( |2001[ )) using = 1 N , GR t ._ l]t ._ 1 G' = R t ._ 1 \ u _ 1 GG' , and the fact that GG' = H. The 
first equation holds true due to the orthonormality of the discrete Fourier transform. The 
second equation follows from the fact that G is 2 x 2 block diagonal and that Rt i _ 1 \t i ^ 1 ls 
diagonal with the diagonal entries being equal for each cosine - sine pair. The last equation 
holds true as shown in the following. Being obvious for the first four frequencies, we consider 
the 2x2 diagonal blocks of cosine - sine pairs: 

[G ! ](2i-5):(2/-4),(2«-5):(2«-4)[G ! ](2i_5) : (2i-4),(2/-5):(2/-4) 

= exp (-2A(k' j 'Ekj + ()) (cos(A/Lt'fcJl 2 - sin(A//fcJJ 2 ) (cos(A//fcJl 2 - sin(A//fcJ J 2 )' 
= exp (-2A(k' j 'Ek j + ()) (cos{An'kj) 2 + sin(A/x'fc j ) 2 ) 1 2 , 



1 = 5,..., N/2 + 2, which equals (26). In the last equation we have used 

J 2 = — J 2 and j\ = — 1 2 . 



Based on the Kalman filter, the log-likelihood is calculated as (see, e.g., Shumway and 
StofferlpOOOl )) 



T 

£ = ^2log + t 2 1 n \ + (w(U) - mt.it._J' [Rtifc-i + ^InT 1 (w(ti) - m^_J + — log 

i=l 

Since the forecast covariance matrices R t .\ t ._ 1 are diagonal, calculation of their determinants 
and their inverses is trivial, and computational costs are again 0(TN). 



Algorithm 2 Spectral backward sampling 

Input: T, G, Q, H, m*.^, m ti | ti , Rt^., R^U-v i = h---,T 
Output: a sample a*(tx), . . . , a*(ir) from [at\-] 

a*(t T ) =m tT \ tT + (R tTltT ) 1/2 n T , n T ~ N(0, 1 N ) 
for % = T - 1 to 1 do 
™U = ™ ti \ ti + RuluR^G' (oc*(t i+1 ) - m^i^.J 

R ti = (QH + R^ ilti _y 

a*(U) = rn u + {R u ) 1/2 th, n< ~ A(0, 1 N ) 
end for 



As said, in a Bayesian context, the main difficulty consists in simulating from the full 
conditional of the latent coefficients [&{■]■ After running the Kalman filter, this can be done 
with a backward sampling step. Together, these two algorithms are know as Forward Filtering 
Backward Sampling (FFBS). Again, backward sampling is computationally very efficient in 
the spectral space with costs being 0(TN). Algorithm [2] shows the backward sampling 
algorithm in the spectral space. The matrices R^ are diagonal which makes their Cholesky 
decomposition trivial. 
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4.2 Dimension Reduction and Incidence Matrix 



If desired, the total computational costs can be additionally alleviated by using a reduced 
dimensional Fourier basis with K « N. This means that one includes only certain frequen- 
cies, typically low ones. The spectral filtering and sampling algorithms then require O(KT) 
operations. For using the FFT, the frequencies being excluded are just set to zero. When the 
observation process is low-dimensional, as is the case in our application, one can include an 
incidence matrix i" that relates the process on the grid to the observation locations. Instead 



of (25), the model is then 

w(t i+1 ) = I$a(t i+1 ) + v{t i+1 ), u{t i+1 ) ~ N(0, t 2 1 n ). (27) 

The FFT cannot be used anymore, and the total computational costs are O(K^T) due to the 
traditional FFBS. 



5 Postprocessing Precipitation Forecasts 



Nowadays, numerical weather prediction (NWP) models are capable of producing predictive 
fields at spatially and temporally high frequencies. Statistical postprocessing serves two 
purposes. First, probabilistic predictions are obtained in cases where only deterministic ones 



are available. Further, even if "probabilistic" forecasts in form of ensembles (Palmer 2002 
Gneiting and Raftery l 2005 ) are available, they are typically not calibrated, 



i.e., 



they are 

often underdispersed ( Hamill and Colucci||1997 ). The goal of postprocessing is then to obtain 
calibrated and sharp predictive distributions (see Gneiting et al. (2007a) for a definition 
of calibration and sharpness). In the case of precipitation, the need for postprocessing is 
particularly strong, since, despite their importance, precipitation forecasts are still not as 



accurate as forecasts for other meteorological quantities (Applequist et al. 2002 Stensrud 



and Yussouf 2007). 



Several approaches for postprocessing precipitation forecasts have been proposed, includ- 
ing linear regression (|Antolik|2000D, lo gistic regression ( |Hamill et al.|2004| ) , quantile regression 



(Bj0rnar Bremnes 2004 Friederichs and Hense 2007), hierarchical models based on prior cli 



matic distribution (Krzysztofowicz and Maranzano 



2005), and binning techniques (Yussouf and Stensrud 2006). Sloughter et al. (2007) propose 



2006), neural networks (Ramrez et al. 



a two-stage model to postprocess precipitation forecasts. Berrocal et al. (2008) extended 



the model of Sloughter et al. (2007) by accounting for spatial correlation. Kleiber et al. 



(2011) present a similar model that includes ensemble predictions and accounts for spatial 



correlation. 

Except for the last two references, spatial correlation is typically not modeled in postpro- 
cessing precipitation forecasts, and none of the aforementioned models explicitly accounts for 
spatio-temporal dependencies. However, for temporally and spatially highly resolved data, 
it is necessary to account for correlation in space and time. First, spatio-temporal correla- 
tion is important, for instance, for predicting precipitation accumulation over space and time 
with accurate estimates of precision. Further, it is likely that errors of NWP models exhibit 
structured behaviour over space and time. A spatio-temporal model can capture structured 
dynamics of this error term and extrapolate a spatial error over time. 
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Locations of forecast gridpoints and observation stations 




500 550 600 650 700 750 800 



Figure 5. Locations of grid points at which predictions are obtained (50 x 100 grid of small 
dots) and observations stations (bold dots). Both axis are in km using the Swiss coordinate 
system (CH1903). 



5.1 Data 

We have precipitation predictions from an NWP model called COSMO-2, a high-resolution 
model with a grid spacing of 2.2 km that is run by MeteoSwiss as part of COonsortium for 
Small-scale Modelling (COSMO) (see,e.g., [STeppeler et aL]2003D . The NWP model produces 



deterministic forecasts once a day starting at 0:00UTC. Predictions are made for eight con- 
secutive time periods corresponding to 24 h ahead. In the following, let yp(t,s) denote the 
forecast for time t and site s made at 0:00UTC of the same day. We consider a rectangular 
region in northern Switzerland shown in Figure [5} The grid at which predictions are made is 
of size 50 x 100. Precipitation is observed at 32 stations over northern Switzerland. Figure 
[5] also shows the locations of the observation stations. We use three hourly data from the 
beginning of December 2008 till the end of March 2009. To illustrate the data, in Figure |6j 
observed precipitation at one station and the equally weighted areal average precipitation are 
plotted versus time. We will use the first three months containing 720 time points for fitting, 
and the last month is left aside for evaluation. 

The NWP model forecasts are deterministic and ensembles are not available in our case. 
However, the extension to use an ensemble instead of just one member can be easily done. 
One can include all the ensemble members in the regression part of the model. Or, in the 
case of exchangeable members, one can use the location and the spread of the ensemble. 



5.2 Precipitation Model for Postprocessing 

The model presented in the following is a Bayesian hierarchical model (BHM). It uses the 
spatio-temporal process presented above at the process level. At the data stage, a model 
adapted to the nature of precipitation is used. A characteristic feature of precipitation is that 
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Figure 6. Precipitation versus time, for one station and averaged over all stations. 



its distribution consists of a discrete component, indicating occurrence of precipitation, and a 
continuous one, determining the amount (see Figure [6]). As a consequence, there are two basic 
statistical modeling approaches. The continuous and the discrete part are either modelled 



separately (Coe and Stern 



1982 



Wilks 1999) or together (Bell 1987; Wilks 



1990 



and Plate||T992[|Hutchinson||1995flSans6 and Guenni||2004[ ) . See, e.g., |Sigrist et al 



a more extensive overview of precipitation models and for further details on the data model 



Bardossy 



(2012) for 



used below. Originally, the approach presented in the following goes back to Tobin (1958) 



who analyzed household expenditure on durable goods. For modeling precipitation, 



Stidd 



( 1973 ) took up this idea and modified it by including a power-transformation for the non-zero 

2 depends on a latent Gaussian 



part so that the model can account for skewness. 

It is assumed that rainfall y(t, s) at time t on site s E 
variable w(t, s) through 



y(t,s) = 0, iiw(t,s)<0, 
= w(t,s) x , i£w(t,s)>0, 



(28) 



where A > 0. A power transformation is needed since precipitation amounts are skewed 
and do not follow a truncated normal distribution. The latent Gaussian process w(t,s) is 
interpreted precipitation potential. 

The mean of the Gaussian process w(t, s) is assumed to depend linearly on spatio-temporal 



covariates x(t, s) € 



As shown below, this mean term basically consists of the NWP 



forecasts. Variation that is not explained by the linear term is modeled using the Gaussian 
process s) and the unstructured term i/(t, s) for microscale variability and measurement 
errors. The spatio-temporal process £(t, s) has two functions. First, it captures systematic 
errors of the NWP in space and time and can extrapolate them over time. Second, it accounts 
for structured variability so that the postprocessed forecast is probabilistic and its distribution 
sharp and calibrated. 



To be more specific concerning the covariates, similarly as in Berrocal et al. (2008), we 



include a transformed variable yp(t, s) 1 ^ and an indicator variable l{y F (t iS )=o} which equals 
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1 if yF(t, s) = and otherwise. A is determined by fitting the transformed Tobit model as 



in ( 28 ) to the marginal distribution of the rain data ignoring any spatio-temporal correlation. 



In doing so, we obtain A ~ 1.4. s) 1 ^ is centered around zero by subtracting its overall 

mean y 1 J X in order to reduce posterior correlations. Thus, w(t, s) equals 



w(t,s) = bi (yF(t,s 



»l/A 



_1/A 

Vf 



+ b 2^{y F (t,s)=0} + €(t, S) + U(t, S). 



(29) 



An intercept is not included since the first Fourier term is constant in space. In our case, 
including an intercept term results in weak identifiability which slows down the convergence 
of the MCMC algorithm used for fitting. Note that in situations where the mean is large it is 
advisable to include an intercept, since the coefficient of the first Fourier term is constrained 
by the joint prior on a. Further, unidendifiability is unlikely to be a problem in these cases. 

Concerning the spatio-temporal process £(t,s), we apply padding. This means that we 
embed the 50 x 100 grid in a rectangular 200 x 200 grid. A brief prior investigation showed that 
the range parameters are relatively large in comparison to the spatial domain, and padding is 
therefore used in order to avoid spurious correlations due to periodicity. Further, we use an 



incidence matrix 7" as in (27) to relate the process at the grid to the observation stations. As 



argued in Section 4.2, this then requires that we use a reduced dimensional Fourier expansion. 
I.e., instead of using N = 200 2 basis functions, we only use K « N low-frequency Fourier 
terms. This is done for the following reasons. First, the observation stations are relatively 
scarce which means that there is probably no information on spatial high frequencies of the 
NWP error and, consequently, the low frequencies are important. Further, the covariates, 
i.e., the NWP forecasts, are not available on the extended 200 x 200 domain, which they need 
to be if we do not use an incidence matrix and model the process on the full grid. 

Concerning prior distributions, we assign weakly informative, and in some cases improper, 
priors to the parameters = (po> C> Pii 7j Q > fix, P"y, t 2 )'- b and A have improper, locally 
uniform priors on R and M+, respectively. Based on Gelman (2006), we use improper priors 



for o" 2 and r 2 that are uniform on the standard deviation scale a and r, respectively. Further, 
the drift parameters fj, x and fj, y have uniform priors on [—0.5, 0.5], and a has a uniform prior 
on [0,7r/2]. The anisotropy parameter 7 has a uniform prior on the log scale of the interval 
[0.1, 10]. 7 is restricted to [0.1, 10] since stronger anisotropy does not seem reasonable. The 
range parameters po and p\ as well as the damping parameter £ are assigned improper, locally 
uniform priors on M + . In summary, 



P[b,\,0] oc 



„ 1 {-0.5<^,M !/ <0.5} 1 {0<a<7r/2} 1 {A,po,Pi,C^ 2 ,T2>0} 1 {0.1<7<10}- 
T 7 



5.3 Fitting 

Monte Carlo Markov Chain (MCMC) is used to sample from the posterior distribution. To 



be more specific, we apply what Neal and Roberts (2006) call a Metropolis within-Gibbs 



algorithm which alternates between blocked Gibbs ( Gelfand and Smith||1990 ) and Metropolis 
(Metropolis et al. 1953 Hastings 1970) sampling steps. For the random walk Metropolis 



steps, we use an adaptive algorithm ( Roberts and Rosenthal|2009 ) meaning that the proposal 
covariances are successively estimated such that an optimal scaling is obtained with an ac- 



ceptance rate between 0.2 and 0.3. See Roberts and Rosenthal (2001) for more information 



on optimal scaling for Metropolis-Hastings algorithms. 
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Our goal is to simulate from the posterior distribution [b, \,9,ot,w\y], where y denotes 
the set of all observations. Note that since the latent process £ is the Fourier transform of 
the coefficients a, knowing the posterior of a is equivalent to knowing the one of In the 
following, we use the notation "[y|-]" and "P[y|-]" to denote conditional distributions and 
densities, respectively. A Gibbs step is used to sample from [w\-]. To be more specific, w 
is only sampled for those values with corresponding observations being censored at zero or 
missing, the full conditionals being truncated and regular Gaussian distributions, respectively. 
See Sigrist et al. (2012) for more details on this type of data augmentation approach. Next, 

in a Metropolis-Hastings step. A proposal (b* ,9* ,a*) is 
) from a Gaussian distribution with the mean equaling the 

and then sampling a* 



we sample jointly from [b,9,a\- 
obtained by first sampling (b*,9 

last value and an adaptively estimated proposal covariance matrix 
from using the forward filtering backward sampling (FFBS) (Carter and Kohn 



1994 Fruhwirth-Schnatter||1994" ) . Joint sampling is preferable over iteratively and separately 
sampling from b* , 6*, and a*, since these variables can be strongly dependent which results 
in slow mixing. In particular a and as well as the fixed and random effects determined by b 
and a, respectively, can be strongly dependent. This problem is similar to the one observed 



when doing inference for diffusion models(see, e.g., Roberts and Stramer (2001) and Golightly 
and Wilkinson] ( [2008) ) ). It can be shown that the acceptance ratio for the joint proposal is 



min 1, 



p[b*,e*\w^} 

P[b®,0®\wW] 



(30) 



where P[b, 9\w^'] denotes the likelihood of (b, 9) given the current w^' evaluated at (b,9), 
and where 9* and 9^ denote the proposal and the last values, respectively. We see that 
this acceptance ratio does not depend on a. Thus, the parameters b and 9 are allowed to 
move faster in their parameter space. We recall that the value of the likelihood P[b, 9\w^] is 
obtained as a side product of the Kalman filter in the FFBS. Finally, for sampling from the full 
conditional [A|-], a random walk Metropolis step is used. After a burn-in of 5,000 iterations, 
we use 100, 000 samples from the Markov chain to characterize the posterior distribution. 
Convergence is monitored by inspecting trace plots. 



5.4 Model Selection and Results 

As said, we use a reduced dimensional approach. The number of Fourier functions is deter- 
mined based on predictive performance for the 240 time points that were set aside. We start 
with models including only spatial low-frequencies and add successively higher frequencies. 
In doing so, we only consider models that have the same resolution in each direction. For 
instance, we do not consider models that have higher frequency spatial basis functions in the 
horizontal direction than in the vertical one. 

In order to assess the performance of the predictions and to choose the number of basis 



functions to include, we use the continuous ranked probability score (CRPS) (Matheson and 
Winkler|[l976l ). The CRPS is a strictly proper scoring rule ( |Gneiting and Raftery||2007 ) that 
assigns a numerical value to probabilistic forecasts and assesses calibration and sharpness 



simultaneously (Gneiting et al. 2007a). It is defined as 



CRPS(F,y) 



(F(x) 



t{y< X 



:}) 2 dx, 
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where F is the predictive cumulative distribution, y is the observed realization, and 1 denotes 
an indicator function. If a sample , . . . , y( m ^ from F is available, it can be approximated 
by 

Tfl 771 

" X> ( ° - fl - A (32) 
l i,j=i 

Ideally, one would run the full MCMC algorithm at each time point t > 720, including all 
data up to the point, and obtain predictive distributions from this. Since this is rather time 
consuming, we make the following approximation. We assume that the posterior distribution 
of the "primary" parameters 6, b, and A given y 1:t = {y 1 , . . . , y t } is the same for all t > 720. 
That is, we neglect the additional information that the observations in March give about 
the primary parameters. Thus, the posterior distributions of the primary parameters are 
calculated only once, namely on the dataset from December 2008 to February 2009. This 
assumption that the posterior of the primary parameters does not change with additional 
data may be questionable over longer time periods and when one moves away from the time 
period from which data is used to obtain the posterior distribution. But since all our data 
lies in the winter season, we think that this assumption is reasonable. If longer time periods 
are considered, one could use sliding training windows or model the primary parameters as 
non-stationary using a temporal evolution. 

For each time point t > 720, we make up to 8 steps ahead forecasts corresponding to 
24 hours. I.e., we sample from the predictive distribution of yl +k , k = 1, . . . 8, given y l t = 
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Figure 7. Comparison of different statistical models using the continuous ranked probability 
score (CRPS). On the left are CRPSs of station specific forecasts and on the right are CRPSs 
of areal forecasts. K denotes the number of basis functions used in the model. "Sep" denotes 
the separable model with K = 29. The unit of the CRPS is mm. 

In Figure [7| the average CRPS of the pointwise predictions and the areal predictions are 
shown for the different statistical models. In the left plot, the mean is taken over all stations 
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and lead times, whereas the areal version is an average over all lead times. This is done for the 
models with different numbers of basis functions used as well as for a separable model which 
is obtained by setting /x = and = 0. Models including only a few low-frequency Fourier 
terms perform worse. Then the CRPS decreases successively. The model based on including 
K = 29 Fourier functions performs best. After this, adding higher frequencies results into 
lower predictive performance. We interpret this results in the way that the observation data 
does not allow for resolving high frequencies. The separable model, with K = 29 Fourier 
terms, clearly performs worse than the model with a non-separable covariance structure. 
Based on these findings, we decide to use the model with 29 cosine and sine functions. 

Table 1. Posterior medians and 95 % credible inte rvals. 





Median 


2.5 % 


97.5 % 


Po 


25.4 


18.8 


32.4 


a 2 


0.838 


0.727 


0.994 


c 


0.00655 


0.000395 


0.0156 


Pi 


48.8 


42.1 


57.1 


7 


4.33 


3.34 


6.01 


a 


0.557 


0.49 


0.617 


P.r 


6.73 


0.688 


12.9 




-4.19 


-8.55 


-0.435 


T 2 


0.307 


0.288 


0.327 




0.448 


0.414 


0.481 


1 {y F (t,s)=0} 


-0.422 


-0.5 


-0.344 


X 


1.67 


1.64 


1.7 



Table [T] shows posterior medians as well as 95% credible intervals for the different param- 
eters. Note that the range parameters pq and p\ as well as the drift parameters p x and p y 
have been transformed back from the unit [0, 1] scale to the original km scale. The posterior 
median of the variance a 2 of the innovations of the spatio-temporal process is around 0.8. 
Compared to this, the nugget variance being about 0.3 is smaller. For the innovation range 
parameter po, we obtain a value of about 25 km. And the range parameter pi that controls 
the amount of diffusion or, in other words, the amount of spatio-temporal interaction, is ap- 
proximately 49 km. With 7 and a being around 4 and 0.6, respectively, we observe anisotropy 
in the south-west to north-east direction. This is in line with the orography of the region, as 
the majority of the grid points lies between two mountain ranges: the Jura to the north-west 
and the Alps to the south-east. The drift points to the south-east, both parameters being 
rather small though. Further, the damping parameter £ has a posterior median of about 0.01. 

Next, we compare the performance of the postprocessed forecasts with the ones from 
the NWP model. In addition to the temporal cross-validation, we do the following cross- 
validation in space and time. We first remove six randomly selected stations from the data, 
fit the latent process to the remaining stations, and evaluate the forecasts at the stations left 
out. Concerning the primary parameters, i.e., all parameters except the latent process, we use 
the posterior obtained from the full data including all stations. This is done for computational 
simplicity and since this posterior is not very sensitive when excluding a few stations. Since 
the NWP produces 8 step ahead predictions once a day, we only consider statistical forecasts 
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Table 2. Comparison of NWP model and statistically postprocessed forecasts ('Stat PP') 
using the mean absolute error (MAE). 'Static' denotes the constant forecast obtained by 
using the most recently observed data. The unit of the MAE is mm. 





Stat PP 


NWP 


Static 


Stationwise 


0.359 


0.485 


0.594 


Areal 


0.303 


0.387 


0.489 



starting at 0:00UTC. This is in contrast to the above comparison of the different statistical 
models for which 8 step ahead predictions were made at all time points and not just once for 
each day. We use the mean absolute error (MAE) for evaluating the NWP forecasts. In order 
to be consistent, we also generate point forecasts from the statistical predictive distributions 
by using medians, and then calculate the MAE for these point forecasts. In Table [2j the results 
are reported. For comparison, we also give the score for the static forecast that is obtained 
by using the most recently observed data. The postprocessed forecasts clearly perform better 
than the raw NWP forecasts. In addition, the postprocessed forecasts have the advantage 
that they provide probabilistic forecasts quantifying prediction uncertainty. 

The statistical model produces a joint spatio-temporal predictive distribution that is spa- 
tially highly resolved. To illustrate the use of the model, we show several quantities in Figure 
[8} We consider the time point t = 760 and calculate predictive distributions over the next 24 
hours. Predicted fields for the period t = 761, . . . , 768 from the NWP are shown in the top 
left corner. On the right of it are pointwise medians obtained from the statistical forecasts. 
This is a period during which the NWP predicts too much rainfall compared to the observed 
data (results not shown). The figure shows how the statistical model corrects for this. For 
illustration, we also show one sample from the predictive distribution. To quantify predic- 
tion uncertainty, the difference between the third quartile and the median of the predictive 
distribution is plotted. These plots again show the growing uncertainty with increasing lead 
time. Other quantities of interest (not shown here), that can be easily obtained, include 
probabilities of precipitation occurrence or various quantiles of the distribution. 

6 CONCLUSION 

We have presented a spatio-temporal model and corresponding efficient algorithms for doing 
statistical inference for large data sets. Instead of using the covariance function, we have 
proposed to use a Gaussian process defined trough an SPDE. The SPDE was solved using 
Fourier functions, and we have given a bound on the precision of the approximate solu- 
tion. In the spectral space, one can use computationally efficient statistical algorithms whose 
computational costs grow linearly with the dimension, the total computational costs being 
dominated by the fast Fourier transform. The space-time Gaussian process defined through 
the advection-diffusion SPDE has a nonseparable covariance structure and can be physically 
motivated. The model is applied to postprocessing of precipitation forecasts for northern 
Switzerland. The postprocessed forecasts clearly outperform the raw NWP predictions. In 
addition, they have the advantage that they quantify prediction uncertainty. 

An interesting direction for further research would be to extend the SPDE based model to 
allow for spatial non-stationarity. Whether this can be done using some form of basis function 
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Figure 8. Illustration of postprocessed spatio-temporal precipitation fields for the period 
t = 761, . . . , 768. The figure shows the NWP forecasts (a), pointwise medians of the predictive 
distribution (b), one sample from the predictive distribution (c), and the differences between 
the third quartile and the median of the predictive distribution (d). All quantities are in mm. 
Note that the scales are different in different figures. 
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expansion approach or by relying on a discretization method such as finite differences or finite 
volumes remains an open question. Since the operators of the SPDE are local, one can define 



the SPDE on general manifolds and, in particular, on the sphere (see, e.g., Lindgren et al. 



(2011)). It would be interesting to investigate to which extent spectral methods can still be 
used. 
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